Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF8 strings lowercase conversion #26

Merged
merged 1 commit into from Mar 14, 2014
Merged

UTF8 strings lowercase conversion #26

merged 1 commit into from Mar 14, 2014

Conversation

p-lambert
Copy link
Contributor

Currently, process_text converts the given string to lowercase in order to perform the word matching, but unfortunately Ruby does not convert UTF8 strings properly.
I've noticed inconsistencies like this one:

require 'whatlanguage'
puts "ÂNCORA COR ÂMBAR".language # => spanish
puts "âncora cor âmbar".language # => portuguese

Thanks for the library!

@peterc
Copy link
Owner

peterc commented Mar 7, 2014

This is a general problem in Ruby. Do you know of any reasonable solutions?

The problem here is WhatLanguage in its current form is dependent on words and having all combinations of casing in the word lists is impractical, so we have to normalize them somehow. Is there a better way to do this normalization?

@p-lambert
Copy link
Contributor Author

I did some research on that and there is no simple solution (like 1-to-1 mappings covering all scenarios) as long as there are several conditions to be taken into account, and mostly because some of them are locale dependent (see, for example, Character Properties, Case Mappings & Names FAQ).

Thus we get stuck in a circular problem: we need to normalize the string in order to identify the language and ideally the language must be taken into account in this process of normalization.

Although this seems rather disappointing, I really believe results would be greatly improved if we at least performed those simple conversions (i.e., the case folding as specified by Unicode), even disregarding these locale dependent rules.

Of course this casing conversion goes beyond the scope of this library, so I would propose to use an external one. https://github.com/lang/unicode_utils seems to do the trick and appears to be well written, using official specifications from Unicode. We could dynamically define a to_lowercase method which would either delegate this conversion to UnicodeUtils if defined or simply perform this by String#downcase. That way the user could optionally require the aforementioned library and it would not be a dependency. This sounds too ugly?

@peterc
Copy link
Owner

peterc commented Mar 7, 2014

I concur. The plan for the next version of WhatLanguage mitigates this somewhat as it will include using histograms of Unicode codepoint usage, but this approach may still be useful.

I think your suggestion in the last paragraph makes sense. Do you want to have a quick attempt at it or would you prefer me to look at it?

@p-lambert
Copy link
Contributor Author

I'll try something! Thanks

@p-lambert
Copy link
Contributor Author

@peterc, any comments on that?

@peterc
Copy link
Owner

peterc commented Mar 14, 2014

I think it's a nice, gentle, mostly hands-off approach that could work for now, so thanks! I'll merge it in :-)

peterc added a commit that referenced this pull request Mar 14, 2014
UTF8 strings lowercase conversion
@peterc peterc merged commit 4b8212e into peterc:master Mar 14, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants